{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TimeWeightedVectorStoreRetriever\n",
    "\n",
    "- Author: [Youngjun Cho](https://github.com/choincnp)\n",
    "- Peer Review : []()\n",
    "- Proofread : [Juni Lee](https://www.linkedin.com/in/ee-juni)\n",
    "- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n",
    "\n",
    "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/10-Retriever/09-TimeWeightedVectorStoreRetriever.ipynb)[![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/10-Retriever/09-TimeWeightedVectorStoreRetriever.ipynb)\n",
    "## Overview\n",
    "\n",
    "```TimeWeightedVectorStoreRetriever``` is a retriever that uses a combination of semantic similarity and a time decay. \n",
    "\n",
    "By doing so, it considers both the \" **freshness** \" and \" **relevance** \" of the documents or data in its results.\n",
    "\n",
    "The algorithm for scoring them is:  \n",
    "\n",
    "> $\\text{semantic\\_similarity} + (1.0 - \\text{decay\\_rate})^{hours\\_passed}$\n",
    "\n",
    "- ```semantic_similarity``` indicates the semantic similarity between documents or data.\n",
    "- ```decay_rate``` represents the ratio at which the score decreases over time.\n",
    "- ```hours_passed``` is the number of hours elapsed since the object was last accessed.\n",
    "\n",
    "The key feature of this approach is that it evaluates the “ **freshness of information** ” based on the last time the object was accessed. \n",
    "\n",
    "In other words, **objects that are accessed frequently maintain a higher score** over time, increasing the likelihood that **frequently used or important information will appear near the top** of search results. This allows the retriever to provide dynamic results that account for both recency and relevance.\n",
    "\n",
    "Importantly, in this context, ```decay_rate``` is determined by the **time since the object was last accessed** , not since it was created. \n",
    "\n",
    "Hence, any objects that are accessed frequently remain \"fresh.\"\n",
    "\n",
    "### Table of Contents\n",
    "\n",
    "- [Overview](#overview)\n",
    "- [Environment Setup](#environment-setup)\n",
    "- [Low decay_rate](#low-decay_rate)\n",
    "- [High decay_rate](#high-decay_rate)\n",
    "- [Summary of the decay_rate](#summary-of-the-decay_rate)\n",
    "- [Testing with Virtual time](#testing-with-virtual-time)\n",
    "\n",
    "### References\n",
    "\n",
    "- [Time-weighted vector store retriever](https://python.langchain.com/docs/how_to/time_weighted_vectorstore/)\n",
    "- [TimeWeightedVectorStoreRetriever](https://python.langchain.com/api_reference/langchain/retrievers/langchain.retrievers.time_weighted_retriever.TimeWeightedVectorStoreRetriever.html)\n",
    "- [mock_now](https://python.langchain.com/api_reference/core/utils/langchain_core.utils.utils.mock_now.html)\n",
    "----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Environment Setup\n",
    "\n",
    "Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.\n",
    "\n",
    "**[Note]**\n",
    "- ```langchain-opentutorial``` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. \n",
    "- You can checkout the [```langchain-opentutorial```](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture --no-stderr\n",
    "%pip install langchain-opentutorial"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install required packages\n",
    "from langchain_opentutorial import package\n",
    "\n",
    "package.install(\n",
    "    [\n",
    "        \"langchain\",\n",
    "        \"langchain_core\",\n",
    "        \"langchain_community\",\n",
    "        \"langchain_openai\",\n",
    "        \"faiss-cpu\"\n",
    "    ],\n",
    "    verbose=False,\n",
    "    upgrade=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Environment variables have been set successfully.\n"
     ]
    }
   ],
   "source": [
    "# Set environment variables\n",
    "from langchain_opentutorial import set_env\n",
    "\n",
    "set_env(\n",
    "    {\n",
    "        \"OPENAI_API_KEY\": \"\",\n",
    "        \"LANGCHAIN_API_KEY\": \"\",\n",
    "        \"LANGCHAIN_TRACING_V2\": \"true\",\n",
    "        \"LANGCHAIN_ENDPOINT\": \"https://api.smith.langchain.com\",\n",
    "        \"LANGCHAIN_PROJECT\": \"TimeWeightedVectorStoreRetriever\",\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can alternatively set API keys such as ```OPENAI_API_KEY``` in a ```.env``` file and load them.\n",
    "\n",
    "[Note] This is not necessary if you've already set the required API keys in previous steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load API keys from .env file\n",
    "from dotenv import load_dotenv\n",
    "\n",
    "load_dotenv(override=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Low decay_rate\n",
    "\n",
    "- A low ```decay_rate``` (In this example, we'll set it to an extreme value close to 0) means that **memories are retained for a longer period** .\n",
    "\n",
    "- A ```decay_rate``` of **0 means that memories are never forgotten** , which makes this retriever equivalent to a vector lookup."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's first initialize the ```TimeWeightedVectorStoreRetriever``` with a very small ```decay_rate``` and ```k=1``` (where ```k``` is the number of vectors to retrieve)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datetime import datetime, timedelta\n",
    "\n",
    "import faiss\n",
    "from langchain.docstore import InMemoryDocstore\n",
    "from langchain.retrievers import TimeWeightedVectorStoreRetriever\n",
    "from langchain_community.vectorstores import FAISS\n",
    "from langchain_core.documents import Document\n",
    "from langchain_openai import OpenAIEmbeddings\n",
    "\n",
    "# Define the embedding model.\n",
    "embeddings_model = OpenAIEmbeddings(model=\"text-embedding-3-small\")\n",
    "\n",
    "# Initialize vector store empty.\n",
    "embedding_size = 1536\n",
    "index = faiss.IndexFlatL2(embedding_size)\n",
    "vectorstore = FAISS(embeddings_model, index, InMemoryDocstore({}), {})\n",
    "\n",
    "# Initialize the time-weighted vector store retriever. (Here, we'll apply a very small decay_rate)\n",
    "retriever = TimeWeightedVectorStoreRetriever(\n",
    "    vectorstore=vectorstore, decay_rate=0.0000000000000000000000001, k=1\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's add a simple example data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['58449575-d54f-47dc-9a76-806eccb850f3']"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Calculate the date of yesterday\n",
    "yesterday = datetime.now() - timedelta(days=1)\n",
    "\n",
    "retriever.add_documents(\n",
    "    # Add a document with yesterday's date in the metadata\n",
    "    [\n",
    "        Document(\n",
    "            page_content=\"Please subscribe to LangChain Youtube.\",\n",
    "            metadata={\"last_accessed_at\": yesterday},\n",
    "        )\n",
    "    ]\n",
    ")\n",
    "\n",
    "# Add another document. No metadata is specified here.\n",
    "retriever.add_documents(\n",
    "    [Document(page_content=\"Will you subscribe to LangChain Youtube? Please!\")]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(metadata={'last_accessed_at': datetime.datetime(2025, 1, 7, 10, 19, 14, 305565), 'created_at': datetime.datetime(2025, 1, 7, 10, 19, 2, 632517), 'buffer_idx': 0}, page_content='Please subscribe to Langchain Youtube.')]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Invoke the retriever to search\n",
    "retriever.invoke(\"LangChain Youtube\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- The document \"Please subscribe to LangChain Youtube\" appears first because it is the **most salient** .\n",
    "\n",
    "- Since the ```decay_rate``` is close to 0, the document is still considered **recent** ."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## High decay_rate\n",
    "\n",
    "When a high ```decay_rate``` is used (e.g., 0.9999...), the **recency score** rapidly converges to 0.\n",
    "\n",
    "If this value were set to 1, all objects would end up with a ```recency``` value of 0, resulting in the same outcome as a standard vector lookup. \n",
    "\n",
    "Initialize the retriever using ```TimeWeightedVectorStoreRetriever``` , setting the ```decay_rate``` to 0.999 to adjust the time-based weight decay rate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the embedding model.\n",
    "embeddings_model = OpenAIEmbeddings(model=\"text-embedding-3-small\")\n",
    "\n",
    "# Initialize vector store empty.\n",
    "embedding_size = 1536\n",
    "index = faiss.IndexFlatL2(embedding_size)\n",
    "vectorstore = FAISS(embeddings_model, index, InMemoryDocstore({}), {})\n",
    "\n",
    "# Initialize the time-weighted vector store retriever.\n",
    "retriever = TimeWeightedVectorStoreRetriever(\n",
    "    vectorstore=vectorstore, decay_rate=0.999, k=1\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Add new documents again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['68d6e6ce-8ab7-4c40-aaf9-1d852eedcb49']"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Calculate the date of yesterday\n",
    "yesterday = datetime.now() - timedelta(days=1)\n",
    "retriever.add_documents(\n",
    "    [\n",
    "        Document(\n",
    "            page_content=\"Please subscribe to LangChain Youtube.\",\n",
    "            metadata={\"last_accessed_at\": yesterday},\n",
    "        )\n",
    "    ]\n",
    ")\n",
    "retriever.add_documents(\n",
    "    [Document(page_content=\"Will you subscribe to LangChain Youtube? Please!\")]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(metadata={'last_accessed_at': datetime.datetime(2025, 1, 7, 10, 29, 2, 687697), 'created_at': datetime.datetime(2025, 1, 7, 10, 28, 37, 213151), 'buffer_idx': 1}, page_content='Will you subscribe to Langchain Youtube? Please!')]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Invoke the retriever to search\n",
    "retriever.invoke(\"LangChain Youtube\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case, when you invoke the retriever, \"Will you subscribe to LangChain Youtube? Please!\" is returned first.\n",
    "- Because ```decay_rate``` is high (close to 1), older documents (like the one from yesterday) are nearly forgotten."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary of the decay_rate\n",
    "\n",
    "- when the ```decay_rate``` is set to a very small value, such as 0.000001:\n",
    "    - The decay rate (i.e., the rate at which information is forgotten) is extremely low, so information is hardly forgotten.\n",
    "    - As a result, there is almost **no difference in time-based weights between more or less recently accessed information** . In this case, similarity scores are given higher priority.\n",
    "\n",
    "- When the ```decay_rate``` is set close to 1, such as 0.999:\n",
    "    - The decay rate is very high, so most of the recently unaccessed information is almost completely forgotten.\n",
    "    - As a result, in such cases, higher scores are given to more recently accessed information.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Testing with Virtual Time\n",
    "\n",
    "```LangChain``` provides some utilities that allow you to test time-based components by mocking the current time.\n",
    "\n",
    "- The ```mock_now``` function is a utility function provided by LangChain, used to mock the current time."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[**NOTE**]  \n",
    "Inside the with statement, all ```datetime.now``` calls return the **mocked time** . Once you **exit** the with block, it reverts back to the **original time** ."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "before mocking\n",
      "now is: 2025-01-07 14:06:37.961348\n",
      "\n",
      "with mocking\n",
      "now is: 2025-01-07 00:00:00\n",
      "\n",
      "without mock_now block\n",
      "now is: 2025-01-07 14:06:37.961571\n",
      "\n"
     ]
    }
   ],
   "source": [
    "import datetime\n",
    "from langchain_core.utils import mock_now\n",
    "\n",
    "# Define a function that print current time\n",
    "def print_current_time():\n",
    "    now = datetime.datetime.now()\n",
    "    print(f\"now is: {now}\\n\")\n",
    "\n",
    "# Print the current time\n",
    "print(\"before mocking\")\n",
    "print_current_time()\n",
    "\n",
    "# Set the current time to a specific point in time\n",
    "with mock_now(datetime.datetime(2025, 1, 7, 00, 00)):\n",
    "    print(\"with mocking\")\n",
    "    print_current_time()\n",
    "\n",
    "# Print the new current time(without mock_now block)\n",
    "print(\"without mock_now block\")\n",
    "print_current_time()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By using the ```mock_now``` function, you can shift the current time and see how the search results change.\n",
    "- This helps you find an appropriate ```decay_rate``` for your use case.\n",
    "\n",
    "**[Note]**  \n",
    "\n",
    "If you set the time too far in the past, an error might occur during ```decay_rate``` calculations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Document(metadata={'last_accessed_at': MockDateTime(2025, 1, 7, 0, 0), 'created_at': datetime.datetime(2025, 1, 7, 10, 28, 37, 213151), 'buffer_idx': 1}, page_content='Will you subscribe to Langchain Youtube? Please!')]\n"
     ]
    }
   ],
   "source": [
    "# Example usage changing the current time for testing.\n",
    "with mock_now(datetime.datetime(2025, 1, 7, 00, 00)):\n",
    "    # Execute a search in this simulated timeline.\n",
    "    print(retriever.invoke(\"Langchain Youtube\"))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
